About the project

This is the first time I have opened RStudio, so I am feeling a bit confused. But I am sure that this enviroment will become more familiar during this course. I think that this course will be interesting since it seems to be a course where you learn by doing. The course is also about two contemporary subjects: open data and open science. Also, it is about two really contemporary subjects: open data and open science. What I expect from this course are at least the following things:

I found out about this course from Weboodi when I was planning what courses I should take this semester.

Here’s a link to my github:

https://github.com/mikkohyy/IODS-project

Here’s another way of creating a link with R Markdown.


Chapter 2: Regression and model validation

Setting up

library(GGally)  
library(ggplot2)  
setwd("data/")  
learning_data <- read.csv(file="learning2014.csv")  

1

str(learning_data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning_data)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21

This is a dataset that consists of 7 variables and 166 rows - or data about 166 students. The variables could be classified to three groups: 1) background information (gender, age, attitude towards statistic), 2) the learning style of the student (deep, strategic, surface) and how well the student did on the test (points). The data was collected from students of a statistic course.

2

ggpairs(learning_data,mapping=aes(col=gender,alpha=0.3),lower=list(combo=wrap("facethist",bins=20)),upper=list(continuous=wrap("cor",size=3)))  

summary(learning_data)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The students are quite young (median=22) and there are more females (110) than males (56). The learning style of the students have more charasteristics of “deep learning” (mean=3.68) than “surface learning” (mean=2.787). The strongest correlation that can be found is between attitude and points (0.437). Other interesting, but not that strong, correlations are deep & attitude, stra & age, stra & points, and surf & points (negative).

Surf is interesting since it has a negative correlation with various variables. But this is also quite self-evident since, for example, deep learning and surface learning are the opposites. There are also some interesting correlations with gender - e.g. with males age has a negative correlation with points.

I think that the general impression is that attitude has a strong correlation with points and surface learning style correlates negatively with various variables.

3

I build my model on the assumption that “a good attitude” and the combination of deep and strategic learning will lead to good results in the test. Seems reasonable, right? Hence, I will use attitude, deep and stra in my model:

model <- lm(points~attitude + deep + surf, data=learning_data)
summary(model)
## 
## Call:
## lm(formula = points ~ attitude + deep + surf, data = learning_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9168  -3.1487   0.3667   3.8326  11.3519 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.3551     4.7124   3.895 0.000143 ***
## attitude      3.4661     0.5766   6.011 1.18e-08 ***
## deep         -0.9485     0.7903  -1.200 0.231815    
## surf         -1.0911     0.8360  -1.305 0.193669    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.313 on 162 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1876 
## F-statistic:  13.7 on 3 and 162 DF,  p-value: 5.217e-08

From the summary of the model we can see that only attitude can be considered to be statistically highly significant (P<0.001 or ***) variable in this case. Or, in other words, there is a really low probability that this variable (in this case attitude) is not relevant in relation to the target variable. Hence, we can conclude that there is a statistical relationship between attitude and points.

Since deep and surf were not statistically significant we change our model. Our new model is simple and will only have attitude as explanatory variable:

new_model <- lm(points~attitude, data=learning_data)

4

summary(new_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

What this model tells us is that high value on attitude seems to increase the points a student gets from a test. In other words, every point on attitude increases the points by an average of 3.53. This is quite logical since I think that scoring high on attitude implies that a student is motivated.

The multiple R-squared basically describes how well “the model fits the set of observations” (this was the definition in the blog). Basically, in this case, it tells that how much of the variation in the points is explained only by attitude. This model’s multiple R-squared is 0.1906 which means that roughly 19% of the variation is explained by attitude.

5

There are some assumptions in regression models. Naturally, one of the assumptions is linearity. Other one is that errors are normally distributed. And it possible to analyze the validity of these assumptions by analyzing the residuals of the models. This is done by asking if the following assumptions are valid: 1) the errors of the model are normally distributed, 2) the size of the errors should not depend on the explanatory variables (or that errors have constant variance), and 3) the errors are not correlated.

par(mfrow=c(2,2))
plot(model,which=c(2,1,5))

Residuals vs Fitted values

Residuals vs Fitted values makes it possible to explore if there is problems with the constant variance assumption. Since there does not seem to be any pattern in the scatter plot, there does not seem to be any problems with this assumption.

Normal QQ-plot

Normal QQ-plot makes it possible to explore if there are problems that relate to the assumption of normal distribution of errors. It seems that “the dots fit the line” relatively well. Although there are some anomalities in the beginning and the end. But I would conclude that there is no problems with this assumption.

Residuals vs Leverage

Residual vs Leverage makes it possible to explore if there are some observations that have a high impact on the model - e.g. if there is a value that has a very high leverage it makes the model less reliable. In my model no single value has a high leverage. Nothing “pulls” and no single value has a high leverage.


Chapter 3: Logistic regression

library(ggplot2)
library(dplyr)
library(boot)
setwd("data/")
alc_data <- read.csv("part3_alc_data.csv")

2

names(alc_data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc_data)
## [1] 382  35

The data is about students in “secondary education of two Portugese schools”. In this data set there are 382 students and 35 variables. The variables describe various attributes such as information about student’s home, their afterschool activities and how well they do in school. The data set was collected “using school reports and questionnaires”. The ages of the students are mostly between 15 and 18 (although there are few that are slightly older).

The data used in this exercise has been combined from two different data sets (first one about the performance on math and the second one Portugese language). The students who were in both of those data sets were selected for this data set. Numeric variables were combined by taking the average of the variables of different data sets - i.e. grades on this data set are the average of the grades of math and Portugese language. Hence, it can be claimed to indicate how well the student did in general

(As a side note, the data I am using (which I wrangled) is different from the data that is linked in the analysis assessment. The difference is in grades. It seems that in the data that was already wrangled the grades are the math-grades (I understood that it should be the average of math and por). So my results may be different from those that are based on the already wrangled data..)

3

The variables I chose were:

  • G3 (final grade): I go with the very traditional assumption that if you get good grades on school (especially if you are 15-18 years old) you probably will not be one of those who are using lots of alcohol. Hence, the higher the grade, the less likely the student has a “high_use”.
  • goout (going out with friends): The more a student goes out with friends the more chances they have to consume alcohol. Hence the more you go out with your friends, the more likely it is that you have a “high_use”. And there is also always peer-pressure.
  • studytime (weekly study time): If a student uses a lot of time to study, it indicates that they are taking their school/future seriously and probably will not be one of those who consume alcohol in high quantity. Other possibility is that they just do not have time to be in environments where alcohol consumption is usually done. Hence, the more time a studend spends studying, the less likely is that they have a high alcohol consumption.
  • internet (Internet access at home): I though that asking does having an Internet access at home predict anything about alcohon consumption would be interesting. My hypothesis is that if you have internet at home, you do not have time for alcohol since you are playing Fortnite.

4

summary(select(alc_data,.data$G3,.data$goout,.data$studytime,.data$internet))
##        G3            goout         studytime     internet 
##  Min.   : 0.00   Min.   :1.000   Min.   :1.000   no : 58  
##  1st Qu.:10.00   1st Qu.:2.000   1st Qu.:1.000   yes:324  
##  Median :12.00   Median :3.000   Median :2.000            
##  Mean   :11.46   Mean   :3.113   Mean   :2.037            
##  3rd Qu.:14.00   3rd Qu.:4.000   3rd Qu.:2.000            
##  Max.   :18.00   Max.   :5.000   Max.   :4.000

It seems that there are outgoing persons (mean 3.113) and most of them have internet at home. The grades seem to be skewed towards higher grades and the students do not seem to that exited about using their time studying (mean is 2.037).

Box and bar plots

par(mar=c(4,4,0.2,0.1))
g.g3 <- ggplot(data=alc_data,aes(x=high_use,y=G3,fill=high_use))
g.g3 + geom_boxplot() + ylab("Final grade")
g.going_out <- ggplot(data=alc_data,aes(x=goout,fill=high_use))
g.going_out + geom_bar() + ylab("Going out")
g.studytime <- ggplot(data=alc_data,aes(x=studytime,fill=high_use))
g.studytime + geom_bar()
g.internet <- ggplot(data=alc_data,aes(x=internet,fill=high_use))
g.internet + geom_bar()

There is at least some differences in grades in relation to high_use - e.g. the median is a bit higher. From the Going out barplot we can see that when goout>3 the group that has high_use becomes much bigger than the group that does not have high_use. With the cases of internet, studytime and goout we get more insight with crosstables:

t.internet <- table(alc_data$internet,alc_data$high_use,dnn=c("internet","high_alc"))
addmargins(round(prop.table(t.internet)*100,1)) 
##         high_alc
## internet FALSE  TRUE   Sum
##      no   11.3   3.9  15.2
##      yes  58.9  25.9  84.8
##      Sum  70.2  29.8 100.0
t.studytime <- table(alc_data$studytime,alc_data$high_use,dnn=c("studytime","hich_alc"))
addmargins(round(prop.table(t.studytime)*100,1))
##          hich_alc
## studytime FALSE TRUE  Sum
##       1    15.2 11.0 26.2
##       2    35.3 15.7 51.0
##       3    13.6  2.1 15.7
##       4     6.0  1.0  7.0
##       Sum  70.1 29.8 99.9
t.goout <- table(alc_data$goout,alc_data$high_use,dnn=c("goout","high_alc"))
addmargins(round(prop.table(t.goout)*100,1))
##      high_alc
## goout FALSE  TRUE   Sum
##   1     5.0   0.8   5.8
##   2    22.0   4.2  26.2
##   3    27.0   6.0  33.0
##   4    10.7  10.5  21.2
##   5     5.5   8.4  13.9
##   Sum  70.2  29.9 100.1

When compared to my hypotheses, these explorations lead to following conclusions:

  • The higher the grade, the less likely the student has a ‘high_use’
    • It seems that those who were classified to have low alcohol use had a bit more better final grades (but the spread with those who do not have high use was lager).
  • The more you go out with friends, more likely it is that you will have a ‘high_use’
    • It seems that those who go out more probably also have a “high_use”.
  • If the student uses a lot of time to study they do not have time/interests to consume alcohol. Therefore, they do not probably have ‘high_use’.
    • I would not say that this was dissaproved. Those who use more time studying will probably also have lowe alcohon consumption.
  • If you have internet at home, you do not have time for alcohol since you are playing Fortnite
    • My hypothesis seems to be wrong.

5

the_model <- glm(high_use~G3+goout+studytime+internet,data=alc_data, family="binomial")
summary(the_model)
## 
## Call:
## glm(formula = high_use ~ G3 + goout + studytime + internet, family = "binomial", 
##     data = alc_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7427  -0.7849  -0.5479   0.9523   2.5187  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.80189    0.69364  -2.598 0.009385 ** 
## G3          -0.03626    0.03875  -0.936 0.349408    
## goout        0.72858    0.11802   6.173 6.68e-10 ***
## studytime   -0.57985    0.16616  -3.490 0.000483 ***
## internetyes  0.11896    0.35473   0.335 0.737368    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 399.89  on 377  degrees of freedom
## AIC: 409.89
## 
## Number of Fisher Scoring iterations: 4

What I find surprising is that G3 (or final grade) is not significant in realtion to high_use. But goout and studytime are significant. And as I thought, having internet at home does not have any significance. From the model we can say that going out has a positive connection (i.e. increase in goout will increase the odds of high_use being true) of and studytime has a negative connection with high_use being true. G3 seems to have a really small negative connection and having internet at home has a positive connection (i.e. having a internet increases the odds of having a high_alc) - but neither of them is significant.

The difference between null deviance (465.68) and Residual deviance (399.89) basically describes how good fit the model is. In the case of this model the difference is 65.79. Which indicates that this model is not bad.

Odds ratios and confidence intervals:

oddr <- coef(the_model) %>% exp
confinterv <- confint(the_model) %>% exp
cbind(oddr,confinterv)
##                  oddr     2.5 %    97.5 %
## (Intercept) 0.1649873 0.0409081 0.6261624
## G3          0.9643898 0.8936694 1.0407265
## goout       2.0721458 1.6539622 2.6296307
## studytime   0.5599823 0.3999663 0.7686348
## internetyes 1.1263211 0.5709167 2.3115264

When we look at the oddr-column we see how the variables affect the odds of having high_use (in this context >1 indicates positive relationship). Goout has relatively large positive relationship - it implies that when goout increases with 1, the odds that the student has high_use increases by 2 (or puts student in 2 times greater odds of having high_use). Confidence intervals suggest that we can say that we can be fairly sure that goout has positive connection and also that studytime has a negative connection. But for example having internet at home is quite useless since its confidence interval is quite wide.

My hypothesis on internet was proved to be quite wrong if we look at the model (it has a positive connection). Another surprising thing is that G3 does not have a connection (or it has only negative 0.03) with alcohol use. Therefore, my hypothesis based on traditional thinking was false. Goout and studytime seemed to have the role that I though they would. Next question is how well my model predicts.

6

Graph:

better_model <- glm(high_use~goout+studytime,data=alc_data, family="binomial")
probs <- predict(better_model, type="response")
alc_data <- mutate(alc_data,probability=probs)
alc_data <- mutate(alc_data, prediction=probability>0.5)
plot <- ggplot(alc_data,aes(x=probability,y=high_use,col=prediction))
plot + geom_point()

Crosstable:

pred.table <- table(high_use=alc_data$high_use,prediction=alc_data$prediction)
addmargins(prop.table(pred.table))
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64659686 0.05497382 0.70157068
##    TRUE  0.19109948 0.10732984 0.29842932
##    Sum   0.83769634 0.16230366 1.00000000

Penalty:

lossf <- function(class,prob) {
    wrong <- abs(class-prob) > 0.5
  mean(wrong)
}
lossf(class=alc_data$high_use,prob=alc_data$probability)
## [1] 0.2460733

This model’s penalty is approximately 0.25 which means that it predicted 75% of the cases right. It seems that my model has difficulties predicting cases that have high_use. It predicted 19% of cases wrong when there was a high_use (it claimed that these cases did not have high_use).

I think that random guessing would be a straighforward and simple guessing strategy:

guess <- sample(c(TRUE,FALSE),size=382,replace=TRUE)
wrong <- alc_data$high_use==guess
mean(wrong)
## [1] 0.4502618

I ran this few times and the penalty was always somewhere between 0.45 and 0.55. Hence, it seems that my model predicts better than just guessing randomly.

7

crossv <- cv.glm(alc_data,cost=lossf,glmfit=better_model,K=10)
crossv$delta[1]
## [1] 0.2460733

According to crossvalidation, my model seems to be a bit better than the on in Data Camp (error of ~0.25 is slightly better than error of ~0.26 - but only slightly).

8

At the beginning there is a model with 10 variables. After each “round” one varible is removed until we have the same model (goout and studytime) that was build in the earlier sections. In the following example there is the code for the first round. For second round I removed traveltime and continued this until there were only two variables in the model.

massive_model <-  glm(high_use~goout+studytime+Fedu+Fjob+Medu+Mjob+freetime+romantic+sex+traveltime,data=alc_data, family="binomial")
crossv <- cv.glm(alc_data,cost=lossf,glmfit=massive_model,K=10)
penalty <- lossf(class=alc_data$high_use,prob=alc_data$probability)
results <- matrix(nrow=9,ncol=3)
results[1,] <- c(10,penalty,crossv$delta[1])

# Matrix of each round:
results
##       [,1]      [,2]      [,3]
##  [1,]   10 0.2094241 0.2277487
##  [2,]    9 0.2146597 0.2303665
##  [3,]    8 0.2251309 0.2486911
##  [4,]    7 0.2277487 0.2329843
##  [5,]    6 0.2146597 0.2460733
##  [6,]    5 0.2277487 0.2434555
##  [7,]    4 0.2146597 0.2356021
##  [8,]    3 0.2460733 0.2539267
##  [9,]    2 0.2460733 0.2460733
results.df <- as.data.frame(results)
ggplot(data=results.df,aes(x=V1)) + geom_line(aes(y=V2,colour="V2")) + geom_point(aes(y=V2,colour="V2")) + geom_line(aes(y=V3,colour="V3")) + geom_point(aes(y=V3,colour="V3")) + scale_x_continuous(trans = "reverse", breaks = unique(results.df$V1)) + ylab("Error") + xlab("Number of predictors")

Red line (V2) describes the value that was produced by the loss function. Blue line (V3) is the value that was produced by cv.glm.


Chapter 4: Clustering and classification

library(MASS)
library(GGally)  
library(ggplot2)
library(tidyverse)
library(corrplot)
data(Boston)

2

dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The Boston dataset is from MASS package (which is a supplementary package to a book about applied statistics). It has 506 dimensions (or observations) and 14 variables. Its content is described as “Housing Values in Suburbs of Boston”. The variables are quite heterogenous. Some examples of these variables are crime rate by town, nitrogen oxides concentration and pupil-teacher ratio by town. All variables are numerical (although chas is a dummy variable).

3

First, let’s look at the data:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

These summaries tell us that the scales of these variables are quite different from each other. There are variables with max value of 8.780 (rm) and variables with max value of 711.0 (tax). Some of these variables seem to have a bit odd structure - e.g. crime rate’s (crim) median is 0.25 but the maximum is ~88.98.

Then a plot that is made with ggpairs (I adjusted the theme a bit to make it more readable):

ggpairs(Boston,upper=list(continuous=wrap("cor",size=2.2)), lower=list(continuous=wrap("points",size=0.5))) + theme(axis.text=element_text(size=5),panel.grid.major=element_blank(),axis.ticks=element_blank(),panel.border=element_rect(linetype="solid", colour="black",fill=NA),strip.text=element_text(size=7))

There are only few variables that are somewhat normally distributed (rm - which is the average number of rooms per dwelling). Most of the variables are skewed and/or bimodal. From the scatter plots we can see find out at least one clear (negative) correlation between median value of homes (medv) and the lower status% (stat). There are also some variables of which most of its values are in the “lower”end of the scale (e.g. crim & chas). The correlations are probably much easier to outline from a correlation matrix:

cor_mtx <- round(cor(Boston),2)
corrplot(round(cor(Boston),2),type="upper",tl.cex=0.7)

There are some negative correlation especially between dis (distance from Boston employment centres) - one of them is a negative correlation with age. Tax has a strong positive correlation with various variables - e.g. rad (accessibility to radial highways). But at least now it is difficult to say much about this data since there does not seem to be any clear patterns.

4

Since the scales of the variables were so all over the place (and that is bad for clustering), we need to scale the data:

boston_sc <- scale(Boston)
boston_sc <- as.data.frame(boston_sc) # scale -function outputs a matrix and we have to transform it into a data frame
summary(boston_sc)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaling changed the data in a way which made the mean of all variables 0. It made also the variables to resemble each other more - for example tax’s scale was from 187.0 to 771.0 and after scaling it is from -1.3127 to 1.1764. This step makes sense if we are interested about the distances between the cases that are described in this dataset.

Then we will craate a categorical variable called crime which is based on the quantiles of crim variable:

brks <- quantile(boston_sc$crim)
lbls <- c("low","med_low","med_high","high")
crime <- cut(boston_sc$crim,breaks=brks,label=lbls,include.lowest=TRUE)
boston_sc$crime <- crime
boston_sc <- boston_sc[,-1] # Remove the old crim variable
summary(boston_sc$crime)
##      low  med_low med_high     high 
##      127      126      126      127

Then let’s divide the data into training (80%) and testing sets (20%):

n <- nrow(boston_sc)
ind <- sample(n, size=n*0.8)
train_set <- boston_sc[ind,]
test_set <- boston_sc[-ind,]

Now we have two sets. First, train_set has randomly chosen 80% cases of the Boston data. Second, test_set has randomly chosen 20% of cases.

5

First we will do a linear discriminant analysis on the training set (train_set). The categorical crime variable is the target variable and rest of the variables are used as predictors:

boston_lda_train <- lda(crime~., data=train_set)
boston_lda_train
## Call:
## lda(crime ~ ., data = train_set)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2524752 0.2425743 0.2500000 0.2549505 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.95208643 -0.9132918 -0.117932980 -0.8887073  0.4486070
## med_low  -0.05431237 -0.3610279 -0.031282114 -0.5624435 -0.1282547
## med_high -0.40063686  0.1538981  0.195445218  0.2736582  0.0510387
## high     -0.48724019  1.0170891 -0.004759149  1.0971906 -0.3705334
##                 age        dis        rad        tax    ptratio      black
## low      -0.8585796  0.8787159 -0.7026361 -0.7302860 -0.4889152  0.3810803
## med_low  -0.3236222  0.3957287 -0.5388911 -0.5059893 -0.1048350  0.3131862
## med_high  0.3477041 -0.3140636 -0.4099119 -0.3164066 -0.1335816  0.1643627
## high      0.8302549 -0.8665006  1.6384176  1.5142626  0.7811136 -0.6939166
##                lstat         medv
## low      -0.75289852  0.509808550
## med_low  -0.12657496 -0.007894033
## med_high  0.03080859  0.129707812
## high      0.85409848 -0.679803527
## 
## Coefficients of linear discriminants:
##                 LD1          LD2        LD3
## zn       0.13462134  0.694407366 -0.9048391
## indus    0.01380803 -0.323627271  0.2928125
## chas    -0.07898923 -0.053952603  0.1064874
## nox      0.44241711 -0.757329976 -1.4778868
## rm      -0.13653331 -0.070846311 -0.2390458
## age      0.26035995 -0.183021366 -0.1188057
## dis     -0.04918220 -0.216710665  0.2046343
## rad      3.09226708  1.054402382  0.2090547
## tax      0.04579333  0.005511385  0.3974043
## ptratio  0.15625987 -0.122292850 -0.3072610
## black   -0.15145679  0.031789763  0.1118318
## lstat    0.21347831 -0.300551368  0.3808175
## medv     0.21911727 -0.457791111 -0.2050390
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9560 0.0344 0.0096

Then a (bi)plot of the LDA (with colors based on the crime rate - for some reason pch=crime_as_n, did not work in this plot):

crime_as_n <- as.numeric(train_set$crime)
plot(boston_lda_train,dimen=2,col=crime_as_n)

The end result looks a bit different if we compare it to the plot that was in data camp exercise (I do not know if this is a bad thing, and if it is a bad thing, how bad it is).

6

Then we will test how good the model we created works on a data that it has not seen before (test_set). First thing to do is to remove the “right answers” from the test_set (crime is the column number 14):

correct_from_test <- test_set[,14]
test_set <- test_set[,-14]

Then we will use the lda model to classify the cases (their crime rates) from the test_set and illustrate the results with a cross-table:

boston_pred <- predict(boston_lda_train,newdata=test_set)
addmargins(table(correct=correct_from_test,predicted=boston_pred$class))
##           predicted
## correct    low med_low med_high high Sum
##   low       18       4        3    0  25
##   med_low    2      17        9    0  28
##   med_high   0       3       21    1  25
##   high       0       0        0   24  24
##   Sum       20      24       33   25 102

It seems that this model was quite good at classifying the cases from the test_set (78/102 were classified right). Although, in the case of med_low, it only managed to get 15/29 right. The whole model managed to place the case in the right “basket of crime rate” in about 76% of the cases. I guess that this could be considered as relatively good success rate.

7

Re- loading and standardizing the Boston dataset:

data(Boston)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)

First we will calculate the distances between the observations. The very basic method, euclidean, is used in this calculation:

boston_euc_dis <- dist(boston_again, method="euclidean")
summary(boston_euc_dis)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

For evaluating the best number of clusters I will use the function that was introduced in Data Camp exercice (max number of clusters in this evaluation is 10):

set.seed(22)
cluster_n <- 10
wcss <- sapply(1:cluster_n, function(k){kmeans(boston_again,k)$tot.withinss})
qplot(x=1:cluster_n,y=wcss,geom="line")

I think that the best number of clusters is 3 since the distances do not decrease much after that point. Another possible number could be 5, but when using it in a visualization, the result is quite messy. Hence, I will go with the classic approach of “less is more”. Next, is the visualization of the clusters when the number of clusters is 3:

boston_km <- kmeans(boston_again,centers=3)
pairs(boston_again,col=boston_km$cluster,cex=0.2)

It seems that the “black cluster” jumps out in each plot (of course it could also be that it is because of its color which is much more visible from the small graphs). There are three variables that seem to have (in most cases) clear clusters in them: black, crim and lstat. My interpration is that the variables black, crim and lstat have an important role in the model and classification - i.e. the clusters are heavily influenced by these variables. It also seems that the cluster that has the color black has “a strong identity” since it is clearly visible in most of the plots. Hence, it is clearly different from the other clusters.

Bonus

I will perform the k-means with four clusters (the arrow-function I use is the same that was in the Data Camp exercise):

data(Boston)
set.seed(22)
boston_again <- scale(Boston)
boston_again <- as.data.frame(boston_again)
boston_km <- kmeans(boston_again,centers=4)
boston_again$cluster <- boston_km$cluster
boston_lda <- lda(cluster~., data=boston_again)
plot(boston_lda, dimen=2, col=boston_again$cluster)
lda.arrows(boston_lda, myscale=2)

It seems that he most influencial individual variable is black. Other variables that have a relatively strong influence are crim, indus, nox and tax. Variables black and crime seem to “pull” in their “own ways” and most of the variables are in a group that “pulls” to left. The fact that black seems to be influential confirms earlier observations (e.g. 7th part of this exercise).

Super-bonus:

model_predictors <- dplyr::select(train_set, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(boston_lda_train$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% boston_lda_train$scaling
matrix_product <- as.data.frame(matrix_product)

# install.packages("plotly")
library(plotly)

plot_ly(x=matrix_product$LD1,  y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=crime_as_n,size=I(40))
plot_ly(x=matrix_product$LD1,  y=matrix_product$LD2, z=matrix_product$LD3, type="scatter3d", mode="markers", color=cluster_col,size=I(40))

The way the points are scattered in these plots resemble each other relatively well. Based on these two plots, I would say that crime-variable seems to have relatively large role when the algorithm defines the cluster. These two plots illustrate it, since the point colors of clusters seem to be scattered quite similarly to the crime rate.


Chapter 5: Dimensionality reduction techniques

Setting up:

library(dplyr)
library(ggplot2)
library(GGally)
library(FactoMineR)
library(tidyr)
setwd("data/")
human_data <- read.csv("human.csv",row.names=1)

1:

  • Variables:
    • edu_ratio: Ratio of female and male populations with secondary education in each country (i.e. proportion of females / proportion of males)
    • labr_ratio: Ratio of labour force particiation of females and males (i.e. proportion of females / proportion of males)
    • exp_edu_yrs: Expected years of schooling
    • life_exp: Life expectancy at birth
    • gni: Gross National Income per capita
    • mat_mor_rat: Maternal mortality ratio
    • adol_brate: Adolescent birth rate
    • parl_rep_pct: Percentage of female representatives in parliament
summary(human_data)
##    edu_ratio        labr_ratio      exp_edu_yrs       life_exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       gni          mat_mor_rat       adol_brate      parl_rep_pct  
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

The scales and units of the variables are quite different from each other - e.g. edu-ratio’s scale is from 0.1717 to 1.4967 and gni’s scale is from 581 to 123124. Gni is also quite interesting variable since third quartile is 24512 but the max is 123124.

ggpairs(human_data, lower=list(continuous=wrap("points",size=0.8))) + theme(axis.text.y=element_text(size=7), axis.text.x=element_text(angle=45, hjust=1, size=7),panel.grid.major=element_blank(),axis.ticks=element_blank(),panel.border=element_rect(linetype="solid", colour="black",fill=NA),strip.text=element_text(size=6))

Some of the variables (e.g. gni and mat_mor_rate) are more or less skewed to the right. Other variables (e.g. labr_ratio and life_exp) are more skewed to the left but not as skewed as those that are skewed to the right.

There are quite strong correlations between some variables. For example, life_exp and exp_yrs_edu has a strong positive correlation (0.789). Also, adole_brate and mat_mor_rate have strong correlations with various variables. In the case of mat_mor_rate they are mostly negative correlations. Not surprisingly, correlations are strong between variables that are “logically” connected to each other - e.g. life_exp has quite strong positive correlation with gni and strong negative correlation with mat_mor_rate.

2

Principal component analysis:

pca_human_data <- prcomp(human_data)

The variability captured by the principal components:

temp <- summary(pca_human_data)
pca_human_pct <- round(100*temp$importance[2,],2)
pca_human_pct
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00

Principal component 1 captures the variance quite well (or almost all (99.9%) of it of it).

labels <- paste0(names(pca_human_pct), " (", pca_human_pct, "%)")

The biplot:

biplot(pca_human_data, cex=c(0.6,0.7), col=c("gray47","red"), xlab=labels[1], ylab=labels[2]) 
Figure 1: A biplot of non-standardized data where PC1 explain 99.99% of the variation. In this analysis *National Gross Income* seems to be the most important variable and basically the data is reduced in the differences that the countries have in *National Gross Income*.

Figure 1: A biplot of non-standardized data where PC1 explain 99.99% of the variation. In this analysis National Gross Income seems to be the most important variable and basically the data is reduced in the differences that the countries have in National Gross Income.

3

Standardizing the data:

human_data_st <- scale(human_data)

The variability captured by the principal components in the standardized data:

pca_human_data_st <- prcomp(human_data_st)
temp_st <- summary(pca_human_data_st)
pca_human_pct_st <- round(100*temp_st$importance[2,],1)
pca_human_pct_st
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3

Creating biplot with the standardized data:

labels_st <- paste0(names(pca_human_pct_st), " (", pca_human_pct_st, "%)")

Creating biplot with the standardized data:

biplot(pca_human_data_st, cex=c(0.6,0.7), col=c("gray47","red"), xlab=labels_st[1], ylab=labels_st[2])
Figure 2: A biplot of the standardized data. In the plot there are three groups of variables. *Labour ratio* and *Percentage of female representatives in parliament* attribute to PC2 and the rest of the variables to PC1. Combined these components collect 69.8% of the variance in the data.

Figure 2: A biplot of the standardized data. In the plot there are three groups of variables. Labour ratio and Percentage of female representatives in parliament attribute to PC2 and the rest of the variables to PC1. Combined these components collect 69.8% of the variance in the data.

The plots are very different from each other. The first figure shows that PC1 catches 99.99% of the variation. The most important variable in building this component seems to have been Gross National Income and the component basically is based on that variable. We probably could use only one axis (or even one variable (gni)) to represent the not-standardized data and we would not loose much information about the dataset.

The first two components (PC1: 54.6, PC2: 16.2%) of the pca the standardized data together catch 70.8 variation in the data. From the Figure 2 we can findthree groups variables that have relatively strong relation to the components. labr_ratio and parl_rep_pct attribute to PC2 and the other variables to PC1.

In the non-standardized data set the variables were really different from each other in unit and scale - e.g. the already mentioned difference between gni and edu_ratio. Hence, in the first PCA too much weight was given to gni.

4:

In standardized data, PC1 (which ammounts 53.6% of the variance) seems to consist of two groups of variables. In the first group there are Expected years of schooling, Life expectancy at birth, Gross National Income per capita and Ratio of female and male populations with secondary education in each country. Second group consists of Maternal mortality ratio and Adolescent birth rate. These groups correlate negatively with each other. This PC1 resembles the division of countries to developed or developing countries - i.e. long education and life expectacy imply that the country is developed and high maternal mortality ratio imply that the country is developed.

Second component seems to consists of variables that measure the gender equality since its elements are Labour ratio and Percent of females in parliament. Surprising thing is that Education ratio is more connected to PC1.

As a summary we have two components: one that shows the variation in the degree of “development” and another that shows how equal the country is.

5:

The tea dataset looks like this:

data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 

Visualization in three parts (to improve the readability):

gather(tea[1:12]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust= 1, size=8))

gather(tea[13:24]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust=1, size=8))

gather(tea[25:36]) %>% ggplot(aes(value)) + facet_wrap("key", scales="free") + geom_bar() + theme(axis.text.x=element_text(angle=45, hjust=1, size=8))

From the documentation we find out that the tea dataset consists of 300 answers to a “questionnaire on tea”. There we three groups of questions in it: 1) how the person consumed tea (18 questions), 2) “what are their product’s perception” (12 questions), and 3) personal details (4 questions). In addition, there are two extra variables in the data. In the following Multiple Correspondence Analysis I will use to variables that are about “perception of tea” (columns 25-36) and variables that tell us about the persons who answered the questions: sex and age_Q.

First I will create a new data set with the variables we want to use:

tea_set <- tea[c(20,23,25:36)]

Then MCA-analysis and visualization:

tea_mca <- MCA(tea_set, graph=FALSE)
summary(tea_mca)
## 
## Call:
## MCA(X = tea_set, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.137   0.113   0.098   0.095   0.092   0.085
## % of var.             11.279   9.276   8.056   7.804   7.561   7.035
## Cumulative % of var.  11.279  20.554  28.610  36.414  43.975  51.010
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.076   0.071   0.066   0.062   0.059   0.054
## % of var.              6.225   5.814   5.439   5.120   4.899   4.443
## Cumulative % of var.  57.235  63.048  68.487  73.608  78.507  82.950
##                       Dim.13  Dim.14  Dim.15  Dim.16  Dim.17
## Variance               0.049   0.047   0.040   0.038   0.034
## % of var.              4.005   3.883   3.297   3.101   2.765
## Cumulative % of var.  86.954  90.837  94.134  97.235 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                    |  0.649  1.025  0.275 | -0.331  0.323  0.071 |
## 2                    |  0.199  0.096  0.030 |  0.374  0.415  0.106 |
## 3                    | -0.015  0.001  0.000 | -0.244  0.175  0.067 |
## 4                    |  0.307  0.229  0.082 | -0.419  0.520  0.154 |
## 5                    |  0.269  0.176  0.059 |  0.038  0.004  0.001 |
## 6                    |  0.599  0.873  0.311 | -0.617  1.125  0.330 |
## 7                    |  0.592  0.854  0.241 | -0.581  0.998  0.232 |
## 8                    | -0.251  0.153  0.062 | -0.516  0.786  0.264 |
## 9                    |  0.605  0.890  0.254 | -0.376  0.419  0.098 |
## 10                   |  0.241  0.141  0.051 | -0.440  0.574  0.172 |
##                       Dim.3    ctr   cos2  
## 1                    -0.641  1.402  0.269 |
## 2                    -0.586  1.170  0.259 |
## 3                    -0.582  1.154  0.381 |
## 4                     0.357  0.435  0.112 |
## 5                    -0.022  0.002  0.000 |
## 6                    -0.240  0.197  0.050 |
## 7                    -0.458  0.714  0.144 |
## 8                    -0.150  0.077  0.022 |
## 9                    -0.108  0.040  0.008 |
## 10                   -0.486  0.805  0.209 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## F                    |  -0.501   7.759   0.366 -10.458 |  -0.122   0.559
## M                    |   0.731  11.320   0.366  10.458 |   0.178   0.816
## 15-24                |  -0.056   0.050   0.001  -0.643 |  -0.681   9.017
## 25-34                |   0.606   4.410   0.110   5.730 |   0.433   2.729
## 35-44                |  -0.091   0.057   0.001  -0.615 |  -0.513   2.223
## 45-59                |  -0.186   0.365   0.009  -1.621 |   0.712   6.534
## +60                  |  -0.572   2.163   0.047  -3.768 |   0.260   0.544
## escape-exoticism     |  -0.129   0.408   0.015  -2.107 |   0.307   2.838
## Not.escape-exoticism |   0.115   0.366   0.015   2.107 |  -0.276   2.551
## Not.spirituality     |   0.133   0.634   0.039   3.407 |   0.028   0.034
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## F                      0.022  -2.546 |  -0.048   0.099   0.003  -0.996 |
## M                      0.022   2.546 |   0.070   0.144   0.003   0.996 |
## 15-24                  0.205  -7.830 |   0.458   4.698   0.093   5.267 |
## 25-34                  0.056   4.087 |   0.652   7.146   0.127   6.164 |
## 35-44                  0.040  -3.477 |  -0.493   2.371   0.037  -3.347 |
## 45-59                  0.129   6.218 |  -0.796   9.399   0.162  -6.950 |
## +60                    0.010   1.713 |  -0.497   2.282   0.036  -3.271 |
## escape-exoticism       0.085   5.041 |   0.270   2.523   0.066   4.429 |
## Not.escape-exoticism   0.085  -5.041 |  -0.243   2.268   0.066  -4.429 |
## Not.spirituality       0.002   0.715 |  -0.376   7.080   0.309  -9.619 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## sex                  | 0.366 0.022 0.003 |
## age_Q                | 0.135 0.332 0.355 |
## escape.exoticism     | 0.015 0.085 0.066 |
## spirituality         | 0.039 0.002 0.309 |
## healthy              | 0.174 0.059 0.082 |
## diuretic             | 0.182 0.110 0.070 |
## friendliness         | 0.143 0.000 0.026 |
## iron.absorption      | 0.055 0.006 0.005 |
## feminine             | 0.435 0.009 0.005 |
## sophisticated        | 0.145 0.038 0.166 |
plot(tea_mca, invisible=c("ind"), habillage="quali",cex=0.8,ylim=c(-1,1),xlim=c(-1,1))

First of all, the dimensions (Dim1: 11.28% and Dim2: 9.28%) do not catch much of the variation in the data.

First of all, the dimensions in this plot (Dim1: 11.28% and Dim2: 9.28%) do not catch much of the variation in the data. As a rough interpretation we can say that variables that describe that tea has “sublime” attributes are relatively close to each other - and the same is true with those variables that are the opposite. These groups are connected to the gender (i.e. F is close to the sublime attributes, M is close to the opposite). It could be interpreted that Dim1 catches the difference between how different genders perceive tea.

Interesting aspect of the plot is the position of the different age groups. Age groups 35-44 and 15-24 resemble each other but 25-34, which is in the between of those groups, does not resemble them. There is also a difference between how the groups of 45-59 and 60+ perceive tea. It seems that Dim2 has to do with age.

I wondered how the type of tea would appear in realtion to the other variables:

tea_set_type <- tea[c(13,20,23,25:36)]
tea_mca <- MCA(tea_set_type, graph=FALSE)
plot(tea_mca, invisible=c("ind"), habillage="quali",cex=0.8,ylim=c(-1.5,1.5),xlim=c(-1.5,1.5))

The biggest surprise is that green tea is not associated with those “sublime” attributes. It also seems that the age groups 25-34 and 15-24 differ from the other age groups. Also, Earl grey is tea that the younger groups seem to prefer.


Chapter 6: Analysis of longitudinal data

setwd("data/")
library(ggplot2)
library(dplyr)
library(tidyr)
library(lme4)

rats.long <- read.csv("rats_long.csv",row.names=1,stringsAsFactors=FALSE)
bprs.long <- read.csv("bprs_long.csv",row.names=1,stringsAsFactors=FALSE)

rats.long$ID <- factor(rats.long$ID)
rats.long$Group <- factor(rats.long$Group)
bprs.long$subject <- factor(bprs.long$subject)
bprs.long$treatment <- factor(bprs.long$treatment)

1:

str(rats.long)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ days  : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

The data consists data about of three groups (Group) of rats (ID) that were on different diets. Weights (weight) of the rats were recorded weekly (Time) over a 9-week period. Group 1 had 8 rats, group 2 had 4 rats and group 3 had 4 rats. We want to know if there is difference between the groups in how the weights of the rats changed? We are using “quick and dirty” approach to find out if there is any difference. We take a general look at the data with the following plot:

ggplot(rats.long, aes(x=Time, y=weight, group=ID)) +
      geom_line(aes(linetype=Group,colour=Group)) + 
      geom_point(size=1,aes(colour=Group)) + 
      scale_x_continuous(name="Time (days)",  
      breaks=seq(0,60,10)) + 
      scale_y_continuous(name="Weight (grams)")

It seems that in the second group there is a rat which baseline weight is much higher than other other members of that group. It seems that with group 2 and 3 the weight is increasing slightly more than in in the group 1. It seems that general trend is that weights increase.

Then we illustrate the tracking phenomenon:

rats.stdprs <- rats.long %>%
               group_by(Time) %>%
               mutate(stdbprs = (weight - mean(weight))/sd(weight) ) %>%
               ungroup()

ggplot(rats.stdprs, aes(x = Time, y = stdbprs, group=ID)) + 
       geom_line(aes(color=Group)) +
       ylim(-2, 2)

In addition, we produce a side-by-side boxplot:

ggplot(rats.long, aes(x=as.factor(Time), y=weight, fill=Group)) + 
      geom_boxplot() 

It seems that each of the groups have an outlier. Especially group 2 seems to be highly influenced by an outlier. But in the case of longitudinal data, information about individual measurements are not that helpful and therefore we need to use summary measures. In this case, we treat day 1 as baseline and look at the summaries of the rest of the measures. First we will create the summaries and produce a boxplot that is based on them:

rats.wo.base <- rats.long %>%
                filter(Time>1) %>%
                group_by(Group, ID) %>%
                summarise (mean=mean(weight)) %>%
                ungroup()

ggplot(rats.wo.base, aes(x=Group, y=mean)) +
       geom_boxplot() + 
       scale_y_continuous(name="mean(weight), days 8-64")

In each of the cases there is an outlier which raises the question of should all of them be removed. If we consider that they would bias the further analysis of the data, we should remove them. The problem with this data is that removing the outliers would leave only 3 subjects to groups 2 and 3, which could be considered problematic. To make this descision I will produce another side-by-side boxplot with the outliers removed:

# Finding the outliers of the groups: 

temp <- which(rats.wo.base$mean==min(rats.wo.base[1:8,3])) 
temp[2] <- which(rats.wo.base$mean==max(rats.wo.base[9:12,3]))
temp[3] <- which(rats.wo.base$mean==min(rats.wo.base[13:16,3])) 

# The ID's of the rats that should be removed are: 
## [1]  2 12 13
# Creating a new version of the summary data without the outliers: 

rats.wo.base.fixd <- rats.long %>% 
                     filter(Time>1) %>%
                     filter(ID != temp[1] & ID != temp[2] & ID != temp[3]) %>%
                     group_by(Group, ID) %>%
                     summarise(mean=mean(weight)) %>%
                     ungroup()

# Removing the outliers from the long data for the side-by-side boxplot: 

rats.long.fixd <- rats.long %>% 
                  filter(ID != temp[1] & ID != temp[2] & ID != temp[3]) %>%
                  ungroup()

ggplot(rats.long.fixd, aes(x=as.factor(Time), y=weight, fill=Group)) + 
      geom_boxplot() 

Removing the outliers seems to be able to clarify the differences between the groups (of course now we have only three rats in groups 2 and 3). Hence, we continue with a new summary plot that does not include the outliers:

ggplot(rats.wo.base.fixd, aes(x=Group, y=mean)) +
geom_boxplot() + 
scale_y_continuous(name="mean(weight), days 8-64")

It is somewhat difficult to say much about the differences between groups from this graph. Although, it seems that group 2 gained more weight than group 1 least and group 3 is somewhere in between. This can be examined in a more formal way with Analysis of Variance, aka ANOVA:

rats.wo.base.fixd <- rats.wo.base.fixd %>% mutate(baseline=rats.long[c(1,3:11,14:16),4])

rat.fit <- lm(mean~baseline+Group,data=rats.wo.base.fixd)

anova(rat.fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## baseline   1 174926  174926 6630.941 3.216e-14 ***
## Group      2   2065    1033   39.147 3.628e-05 ***
## Residuals  9    237      26                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What we can interpret from ANOVA is that there is statistically significant differences between the groups of rats (at least in the data that did not have the outliers in it). This proves what was found in the “eye-test” test where Group 2 seemed to be quite different from the other groups.

2:

str(bprs.long)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : chr  "week0" "week0" "week0" "week0" ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

The data consists of 40 male subjects (subject) that belong to one of two treatment groups (treatment). Each subject was rated on BPRS-scale (bprs) once before treatment and then once a week (week) for eight weeks. What we want to know is if there is any difference between the treatment groups in how the BPRS-score during the treatments.

First we ignore the nature of the data which is that it is “repeated-measures” and produce a simple linear regression model where bprs is the target variable, and week and treatment are the predictors (in other words, we assume that, for example, two bprs scores on different weeks that are actually connected to the same subject, are not connected to each other):

bprs.reg <- lm(bprs~week+treatment,bprs.long)

summary(bprs.reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = bprs.long)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Then a plot of individual bprs values:

ggplot(bprs.long, aes(x=week, y=bprs)) +
        geom_text(aes(label=treatment), size=3) +
        scale_linetype_manual(values=rep(1:10,times=4)) +
        facet_grid(. ~ treatment, labeller = label_both) +
        theme(legend.position="none") +
        scale_y_continuous(limits = c(min(0), max(100)))

This plot does not seem to tell us much about what is the difference between the treatments. Both of them seem to lower the bprs-score.

What we can find out from the model is that “time matters” since week’s p-value is <0.001. The model says that one point increase in week-variable decreases bprs’s value with 2.27. But this model is unrealistic since it does not consider the data as repeated measures of the same subjects. This can be fixed with random intercept model that “understands” that our data consists of repeated measures of the same subjects. First a visualization of the data when it is interpreted in this way:

ggplot(bprs.long, aes(x=week, y=bprs, 
        linetype = subject)) +
        geom_line() +
        scale_linetype_manual(values=rep(1:10,times=4)) +
        facet_grid(. ~ treatment, labeller = label_both) +
        theme(legend.position="none") +
        scale_y_continuous(limits = c(min(0), max(100)))

Then the model and its summary:

bprs.lemr <- lmer(bprs~week+treatment+(1|subject), data=bprs.long)

summary(bprs.lemr)
## Linear mixed model fit by REML ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: bprs.long
## 
## REML criterion at convergence: 2735.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0468 -0.6765 -0.1345  0.4811  3.4675 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  50.17    7.083  
##  Residual             104.83   10.239  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9468   23.86
## week         -2.2704     0.2090  -10.86
## treatment2    0.5722     1.0792    0.53
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.429       
## treatment2 -0.277  0.000

Unlike the regression model, the random intercept model does not consider the measures be independent from the subject. Afther that we produce a random intercept and random slope model where we have an interaction between week and subject:

bprs.lemr.slp <- lmer(bprs~week+treatment+(week|subject), data = bprs.long, REML = FALSE)

summary(bprs.lemr.slp)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: bprs.long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000

Then random intercept and random slope model with interaction that has week x treatment interaction:

bprs.lemr.slp.intr <- lmer(bprs~week*treatment + (week|subject), data = bprs.long, REML = FALSE)

summary(bprs.lemr.slp.intr)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: bprs.long
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840

Now we have two models (one without interaction and the other with interaction) that we can compare with ANOVA:

anova(bprs.lemr.slp,bprs.lemr.slp.intr)
## Data: bprs.long
## Models:
## bprs.lemr.slp: bprs ~ week + treatment + (week | subject)
## bprs.lemr.slp.intr: bprs ~ week * treatment + (week | subject)
##                    Df    AIC    BIC  logLik deviance  Chisq Chi Df
## bprs.lemr.slp       7 2745.4 2772.6 -1365.7   2731.4              
## bprs.lemr.slp.intr  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1
##                    Pr(>Chisq)  
## bprs.lemr.slp                  
## bprs.lemr.slp.intr    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA seems to imply that there is no difference between the models.

Then let’s create two plots where we can compare the data with the model visually. On the left are the original values and on the right the fitted values:

fit.val <- fitted(bprs.lemr.slp.intr)
bprs.long$fitted <- fit.val

ggplot(bprs.long, aes(x = week, y = bprs, linetype = subject)) +
      geom_line() +
      facet_grid(. ~ treatment, labeller = label_both) +
      theme(legend.position="none") +
      scale_y_continuous(limits = c(min(min(bprs.long$bprs)), max(max(bprs.long$bprs))), name="Bprs score") +
      scale_x_continuous(name="Time (weeks)") +
      theme(legend.position="none")

ggplot(bprs.long, aes(x = week, y = fitted, linetype = subject)) +
      geom_line() +
      facet_grid(. ~ treatment, labeller = label_both) +
      theme(legend.position="none") +
      scale_y_continuous(limits = c(min(min(bprs.long$fitted)), max(max(bprs.long$fitted))), name="Bprs score") +
      scale_x_continuous(name="Time (weeks)") +
      theme(legend.position="none")